Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection
نویسندگان
چکیده
There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J. Optim., 22(2), 2012 ], who showed that a random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection rule. This result suggests that we should never use the Gauss-Southwell rule, because it is typically much more expensive than random selection. However, the empirical behaviours of these algorithms contradict this theoretical result: in applications where the computational costs of the selection rules are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that—except in extreme cases—its convergence rate is faster than choosing random coordinates. We also (i) show that exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and (iv) analyze proximal-gradient variants of the Gauss-Southwell rule. 1 Coordinate Descent Methods There has been substantial recent interest in applying coordinate descent methods to solve large-scale optimization problems, starting with the seminal work of Nesterov [2012], who gave the first global rate-ofconvergence analysis for coordinate-descent methods for minimizing convex functions. This analysis suggests that choosing a random coordinate to update gives the same performance as choosing the “best” coordinate to update via the more expensive Gauss-Southwell (GS) rule. (Nesterov also proposed a more clever randomized scheme, which we consider later in this paper.) This result gives a compelling argument to use randomized coordinate descent in contexts where the GS rule is too expensive. It also suggests that there is no benefit to using the GS rule in contexts where it is relatively cheap. But in these contexts, the GS rule often substantially outperforms randomized coordinate selection in practice. This suggests that either the analysis of GS is not tight, or that there exists a class of functions for which the GS rule is as slow as randomized coordinate descent. After discussing contexts in which it makes sense to use coordinate descent and the GS rule, we answer this theoretical question by giving a tighter analysis of the GS rule (under strong-convexity and standard smoothness assumptions) that yields the same rate as the randomized method for a restricted class of functions, but is otherwise faster (and in some cases substantially faster). We further show that, compared to the usual constant step-size update of the coordinate, the GS method with exact coordinate optimization has a provably faster rate for problems satisfying a certain sparsity constraint (Section 5). We believe that this is the first result showing a theoretical benefit of exact coordinate optimization; all previous analyses show that these strategies obtain the same rate as constant step-size updates, even though exact optimization tends to be faster in practice. Furthermore, in Section 6, we propose a variant of the GS rule that, similar to Nesterov’s more clever randomized sampling scheme, uses knowledge of the Lipschitz constants of the coordinate-wise gradients to obtain a faster rate. We also analyze approximate GS rules (Section 7), which 1 ar X iv :1 50 6. 00 55 2v 1 [ m at h. O C ] 1 J un 2 01 5 provide an intermediate strategy between randomized methods and the exact GS rule. Finally, we analyze proximal-gradient variants of the GS rule (Section 8) for optimizing problems that include a separable nonsmooth term. 2 Problems of Interest The rates of Nesterov show that coordinate descent can be faster than gradient descent in cases where, if we are optimizing n variables, the cost of performing n coordinate updates is similar to the cost of performing one full gradient iteration. This essentially means that coordinate descent methods are useful for minimizing convex functions that can be expressed in one of the following two forms:
منابع مشابه
Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence
Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In th...
متن کاملSupplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments"
f(α) ≡ g(Eα) + bα, f(·) and g(·) are proper closed functions, E is a constant matrix, and Li ∈ [−∞,∞), Ui ∈ (−∞,∞] are lower/upper bounds. It has been checked in [1] that l1 and l2 loss SVM are in the form of (I.1) and satisfy additional assumptions needed in [4]. We introduce an important class of gradient-based scheme for CD’s variable selection: the Gauss-Southwell rule. It plays an importan...
متن کاملFinding a Maximum Weight Sequence with Dependency Constraints
In this essay, we consider the following problem: We are given a graph and a weight associated with each vertex, and we want to choose a sequence of vertices that maximizes the sum of the weights, subject to some constraints arising from dependencies between vertices. We consider several versions of this problem with different constraints. These problems have applications in finding the converg...
متن کاملAccelerating ISTA with an active set strategy
Starting from a practical implementation of Roth and Fisher’s algorithm to solve a Lasso-type problem, we propose and study the Active Set Iterative Shrinkage/Thresholding Algorithm (AS-ISTA). The convergence is proven by observing that the algorithm can be seen as a particular case of a coordinate gradient descent algorithm with a Gauss-Southwell-r rule. We provide experimental evidence that t...
متن کاملApproximate Steepest Coordinate Descent
We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to n, the number of coordinates. In many practical applicatio...
متن کامل